Skip to content

fix: improve message delivery reliability (Telegram + Feishu)#266

Open
lowmiaq-gmail wants to merge 2 commits intoop7418:mainfrom
lowmiaq-gmail:fix/message-reliability
Open

fix: improve message delivery reliability (Telegram + Feishu)#266
lowmiaq-gmail wants to merge 2 commits intoop7418:mainfrom
lowmiaq-gmail:fix/message-reliability

Conversation

@lowmiaq-gmail
Copy link
Copy Markdown

Summary

  • Telegram 通知增加重试 + 指数退避(之前 fire-and-forget,一次失败就永久丢消息)
  • 飞书资源下载(图片/文件/音视频)增加重试,最多 3 次尝试 + 指数退避
  • 飞书入站消息去重持久化到 channel_offsets 表(重启后不丢 dedup 状态)
  • 飞书内存去重上限 1000 → 5000

Root Cause

通知发送和资源下载没有任何重试逻辑,网络瞬态故障(超时、断连、429 限流)直接导致消息静默丢失。

Changes

src/lib/telegram-bot.ts

  • 新增 callWithRetry() — 指数退避重试,4xx(非 429)不重试
  • sendMessage() 切换为 callWithRetry

src/lib/bridge/adapters/feishu-adapter.ts

  • downloadResource() 包裹重试循环(文件过大直接返回,不浪费重试)
  • addToDedup() 写入 channel_offsets 表持久化
  • DEDUP_MAX 1000 → 5000

Test plan

  • Telegram 通知网络失败后自动重试恢复
  • 飞书图片下载失败后重试成功
  • 重启 CodePilot 后飞书消息不重复处理
  • TypeScript 编译通过 ✅

🤖 Generated with Claude Code

Telegram notifications:
- Add retry with exponential backoff to sendMessage() (was fire-and-forget)
- Skip retry for 4xx errors (except 429 rate limit)
- Max 2 retries with jittered backoff

Feishu adapter:
- Add retry (max 2) with exponential backoff to resource downloads
- Size-limit failures skip retry (not transient)
- Persist last processed message_id to channel_offsets table (survives restart)
- Increase in-memory dedup cap from 1000 to 5000

Root cause: notifications and resource downloads had zero retry logic,
causing silent message loss on transient network failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 13, 2026

Someone is attempting to deploy a commit to the op7418's projects Team on Vercel.

A member of the Team first needs to authorize it.

When bridge_parallel_tasks setting is enabled and the current session
is busy processing a message, new incoming messages automatically spawn
ephemeral worker sessions instead of queueing behind the active task.

This allows users to send multiple independent tasks via Feishu/Telegram
and have them processed concurrently with separate Claude streams,
eliminating the sequential bottleneck.

- channel-router: add createWorkerBinding() for ephemeral worker sessions
- bridge-manager: detect busy sessions and dispatch to workers
- handleMessage: accept optional binding override for worker routing
- Worker sessions inherit model, provider, working dir, mode from parent
- Backward compatible: disabled by default (opt-in via setting)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
op7418 added a commit that referenced this pull request Apr 14, 2026
…esources, thread session

Five community-reported Feishu bridge issues fixed together, audited against
the current channels/feishu/* architecture (not the stale bridge/adapters/feishu-adapter.ts).

#321 — thread session bleed:
- inbound.ts: threadSession config was loaded but never checked. Thread address
  encoding now guarded by config.threadSession, so the setting actually works.

#384 — requireMention group filtering:
- inbound.ts: parse mentions[] and drop un-mentioned group messages when the
  setting is on. Strip the @bot placeholder from text so the LLM sees clean input.
- index.ts: resolve bot open_id via getBotInfo() with retry after gateway.start().
- Fail-open during the bot identity startup gap — dropping every group message
  while identity resolves would look like a broken bot. Once identity resolves,
  the gate activates. If resolution fails entirely, logs clearly warn that
  requireMention is inactive.

#282 — AskUserQuestion interactive card:
- permission-broker.ts: remove blanket deny blacklist. Build ask:{requestId}:{idx}
  card with option buttons on channels that support them; store the questions
  payload in the suggestions field so the callback can echo them back.
- handleAskUserQuestionCallback: resolves with updatedInput = { questions, answers }
  matching the native tool's expected shape.
- QQ/Weixin (no button support): deny with clear reason rather than falling back
  to Allow/Deny, which would execute the tool with empty answers and produce
  "The user did not provide any answers." Caller can retry as plain text.
- full_access auto-approve still skips AskUserQuestion — the user's choice
  carries semantic meaning beyond permission consent.
- outbound.ts: indigo "Question" card header for ask: callbacks.
- bridge-manager.ts: route ask: callbacks before perm: callbacks, skip the
  "Permission response recorded" confirmation (the model's reply is the answer).

#291 — file/image/audio/video support:
- New resource-downloader.ts: downloads via im.messageResource.get with 20MB
  limit, 2-retry exponential backoff, permanent-error detection (not-found /
  permission → don't retry).
- inbound.ts: extract resource metadata (file_key, type, caption) for non-text
  message types into PendingResource[]. New parseMessageWithResources exposes
  both the base message and pending downloads to the caller.
- index.ts: downloadAndEnqueue runs the downloads async so the gateway handler
  doesn't block long uploads; partial failures still enqueue with the text we have.
- bridge-manager.ts: type-aware fallback prompt for attachment-only turns —
  replaces hardcoded "Describe this image" with per-type prompts (image/audio/
  video/mixed) so non-image attachments don't mislead the model.

#266 — outbound delivery reliability:
- outbound.ts: sendMessage wrapped with exponential-backoff retry (2 retries).
  isTransientError() skips 4xx permanent errors (invalid app_id, missing
  scope, etc.) and retries on timeouts/network/5xx/rate-limit.

Docs:
- issue-tracker.md: update B-012 (multi-adapter) to "需重新诊断" based on code
  audit finding no obvious isolation break. Add B-017 tracking Feishu WSClient
  stability cluster (#323 #288 #199 #149 #148) that needs user logs.

All 953 unit tests pass. Typecheck clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
op7418 added a commit that referenced this pull request Apr 14, 2026
Add sections 22-30 to bridge-system.md covering the new/fixed behavior shipped
in v0.50.0:

- §22 Feishu one-click app registration (device flow, session state machine,
  slow_down backoff, Lark failover, error_code contract, cancel semantics,
  AbortController + run-id race protection)
- §23 Feishu authorization enforcement (dmPolicy/groupPolicy/allowFrom/
  groupAllowFrom — previously dead config; thread-session address stripping)
- §24 Feishu bot identity resolution (fail-open startup window, 3-attempt
  quick retry, 60s background retry, generation guard against stale probes)
- §25 Feishu WSClient force close (no more ghost connections after restart)
- §26 Global bridge stop aborts active tasks (matches /stop semantics)
- §27 AskUserQuestion interactive card (ask: callback, channel capability
  branching, strict validation rejecting multi-question/multi-select,
  Feishu indigo Question header)
- §28 Feishu resource messages (image/file/audio/video, resource-downloader
  with 20MB limit + retry, type-aware fallback prompt, binary-safe history
  replay via isTextLikeMime)
- §29 Feishu outbound retry (#266)
- §30 Feishu thread-session guard (#321)

Also update directory listing, API route table, and `feishu/` module
descriptions to reflect the new files (feishu-app-registration.ts,
resource-downloader.ts) and changed modules.

Fact-checked against current code — line numbers and constants match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant